Skip to content

fix: suffixes_prefixes_titles always reflects current set state#166

Open
gaoflow wants to merge 5 commits into
derek73:masterfrom
gaoflow:fix/suffixes-prefixes-titles-stale-cache
Open

fix: suffixes_prefixes_titles always reflects current set state#166
gaoflow wants to merge 5 commits into
derek73:masterfrom
gaoflow:fix/suffixes-prefixes-titles-stale-cache

Conversation

@gaoflow

@gaoflow gaoflow commented Jun 25, 2026

Copy link
Copy Markdown

Problem

Constants.suffixes_prefixes_titles cached its result in _pst after the
first access and never invalidated that cache. Any subsequent add() or
remove() call on titles, prefixes, suffix_acronyms, or
suffix_not_acronyms was silently ignored by the stale cache.

This creates an observable inconsistency: is_title() / is_prefix() /
is_suffix() query the live sets and return the correct answer, but
is_rootname() — which delegates to suffixes_prefixes_titles — keeps
returning the stale cached answer, causing it to contradict the other helpers.

Minimal reproduction:

from nameparser import HumanName
from nameparser.config import Constants

C = Constants()
# Access the property to prime the cache.
_ = C.suffixes_prefixes_titles

C.prefixes.add('xpfx')
hn = HumanName("", C)

# These two should agree — before this fix they don't:
print(hn.is_prefix('xpfx'))   # True  (correct — queries live set)
print(hn.is_rootname('xpfx')) # True  (wrong  — reads stale cache, should be False)

The same inconsistency affects titles and suffixes added or removed after the
property is first read. join_on_conjunctions relies on is_rootname to
count rootname_pieces and choose whether to join single-letter conjunctions,
so a stale _pst can silently skew that decision.

Fix

Drop _pst entirely and compute the set union fresh on every access. The
union of four SetManager instances is a simple O(n) operation with a small
constant and is called only during name parsing, so there is no meaningful
performance cost.

Tests

Five new tests in tests/test_constants.py verify that:

  • suffixes_prefixes_titles reflects titles and prefixes added after
    construction.
  • suffixes_prefixes_titles no longer contains a title that was removed.
  • is_rootname and is_title/is_prefix remain consistent after
    add()/remove() calls.

All 682 tests pass (672 pre-existing + 10 from the dual-run fixture applied
to the 5 new test methods).


This pull request was prepared with the assistance of AI, under my direction and review.

gaoflow and others added 3 commits June 25, 2026 14:04
The `suffixes_prefixes_titles` property on `Constants` cached its result
in `_pst` after the first access.  Any subsequent `add()` or `remove()`
call on `titles`, `prefixes`, `suffix_acronyms`, or `suffix_not_acronyms`
was silently ignored by the cache, so `is_rootname()` kept returning the
stale result.

Concretely, a word added to `C.titles` after the property was first
accessed would still be treated as a rootname (first/middle/last) by
`join_on_conjunctions`, even though `is_title()` correctly returned
`True` for it.  This contradicts the documented behaviour of per-instance
config customisation described in AGENTS.md.

Fix: drop the `_pst` cache entirely and compute the union fresh on every
access.  The four-set union is cheap and the simplest correct approach.

Add five tests that assert the property and `is_rootname` stay consistent
with the live sets after `add()`/`remove()` calls.
…_setattr__

The original cache was dropped to fix staleness, but recomputing the set
union on every call is ~1000x slower. This restores the cache with a
correct invalidation strategy: SetManager fires an on_change callback
after add/remove, and Constants.__setattr__ clears _pst and re-wires
the callback whenever one of the four contributing attrs is replaced
(covering both user mutations and conftest teardown restores).

The conftest manual `CONSTANTS._pst = None` reset is removed — it is
now handled automatically by __setattr__ during collection restore.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lper

Add missing test cases flagged during PR review:
- prime-then-mutate test that would catch a regression to cached staleness
- add tests for suffix_acronyms and suffix_not_acronyms (previously untested)
- remove test for prefixes (mirroring the existing titles remove test)
- assertFalse(x in ...) -> assertNotIn for the two remove tests

Also adds assertNotIn to HumanNameTestBase to support the new assertions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@derek73

derek73 commented Jun 27, 2026

Copy link
Copy Markdown
Owner

Two follow-up commits added based on review feedback:

f69ecee — perf: restore _pst cache with proper invalidation

The original fix dropped the cache entirely, which is ~1000x slower per call (~69 µs vs ~70 ns). This commit restores it with correct invalidation:

  • SetManager now accepts an optional on_change callback, fired after add / remove — so user mutations immediately clear _pst
  • Constants.__setattr__ clears _pst and re-wires the callback whenever one of the four contributing attrs is replaced — this covers the test fixture teardown path where setattr(CONSTANTS, 'prefixes', SetManager(...)) swaps in a fresh instance
  • The manual CONSTANTS._pst = None reset in conftest.py is removed; __setattr__ handles it automatically

5fc4b6b — test: expand coverage and add assertNotIn helper

  • Added a prime-then-mutate test (_ = c.suffixes_prefixes_titles; c.titles.add(...); assertIn(...)) — the exact scenario from the bug report, which none of the original five tests exercised
  • Added add-tests for suffix_acronyms and suffix_not_acronyms (both were untested)
  • Added a remove-test for prefixes (mirroring the existing titles remove test)
  • Replaced assertFalse(x in ...) with assertNotIn (added shim to HumanNameTestBase)

690 tests pass (682 pre-existing + 8 new × dual-run fixture).

derek73 and others added 2 commits June 27, 2026 13:15
Six of the suffixes_prefixes_titles tests passed against the unfixed
code because they read the property only once on a cold cache, so they
asserted "the union includes its source sets" rather than "the cache
stays consistent after a mutation." Prime the cache before mutating so
they actually exercise invalidation; verified they now fail (14 across
the dual-run fixture) against master's pre-fix config.

Change the property guard from `if not self._pst` to `if self._pst is
None` so a legitimately empty union caches instead of recomputing on
every access — and so "cached" means "computed," which is the state the
primed tests rely on.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… leak

Replace the catch-all Constants.__setattr__ + _PST_ATTRS frozenset with a
_CachedUnionMember descriptor on the four cached attributes (prefixes,
suffix_acronyms, suffix_not_acronyms, titles). The invalidation/rewiring
behavior now lives beside the attributes it governs, runs only for those
four (not on every attribute assignment), and removes the hand-synced
attribute-name list.

Fixes a stale-callback leak: reassigning a manager now detaches the
replaced one (_on_change = None) so an orphaned reference can no longer
invalidate the live cache. Drop the unused on_change constructor
parameter from SetManager — wiring is always done by the owning Constants
after construction (e.g. the config-teardown setattr path), so the
constructor channel was dead and misleading.

Behavior, mypy, and ruff are unchanged; parse throughput is identical to
the cached baseline (~18us/name vs ~263us/name without the cache).

Tests: add coverage for the replaced-manager rewiring path, the
orphan-no-longer-invalidates guarantee, add_with_encoding invalidation,
and suffix-set removal after priming; add an assertIs shim to the test
base.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@derek73

derek73 commented Jun 27, 2026

Copy link
Copy Markdown
Owner

388c65b — refactor: replace the __setattr__ cache-invalidation with a descriptor

A follow-up on the cache-invalidation machinery added in f69ecee. That commit kept the _pst cache correct after add()/remove() by wiring a SetManager.on_change callback and using a custom Constants.__setattr__ (plus a _PST_ATTRS frozenset) to clear _pst and re-wire the callback whenever one of the four contributing attributes was replaced. It worked, but a closer review surfaced three issues worth tidying before merge:

  1. The __setattr__ was global for a four-attribute concern. It ran on every attribute assignment to Constants (string_format, capitalize_name, …), paying an in-check each time, and the "which attributes are cached" knowledge was duplicated across three places that had to be kept in sync by hand: the _PST_ATTRS frozenset, the attribute annotations, and the union expression in suffixes_prefixes_titles.

  2. A stale-callback leak. Reassigning a manager (c.prefixes = SetManager(...)) wired the new one but never detached the old one. The orphaned manager still held self._invalidate_pst, so mutating that dangling reference would invalidate a cache it was no longer part of — benign (it only nulls the cache, never corrupts it), but a latent foot-gun and an unnecessary recompute.

  3. A dead constructor parameter. SetManager.__init__(on_change=...) was never actually passed — wiring always happened afterward via the external value._on_change = ... poke (including the config-teardown setattr path). The advertised constructor contract didn't match how the callback was really wired.

The fix moves the behavior onto a small _CachedUnionMember descriptor placed on exactly the four cached attributes:

  • Invalidation/re-wiring now lives beside the attributes it governs and fires only for those four, so the global __setattr__ and the _PST_ATTRS frozenset are both gone.
  • On reassignment it detaches the replaced manager (previous._on_change = None), closing the stale-callback leak.
  • The unused on_change constructor parameter is dropped; _on_change is initialized to None and wired by the owning Constants, which is the only path that was ever exercised.

The in-place add()/remove()_on_change callback mechanism on SetManager is unchanged — that's still what catches mutations to an existing manager; only the whole-attribute-replacement path moved from __setattr__ to the descriptor.

Behavior, mypy, and ruff are all unchanged (the descriptor uses @overload for the instance-vs-class __get__ return types so the SetManager type still flows through). Parse throughput is identical to the cached baseline — ~18 µs/name, vs ~263 µs/name with no cache — so the performance reason for keeping _pst still holds.

New tests cover the paths the previous version left unguarded: the replaced-manager re-wiring path, the "orphaned manager no longer invalidates the live cache" guarantee (asserted by cache identity, since the leak's only symptom is an unnecessary recompute), add_with_encoding invalidation, and removal from the two suffix sets after the cache is primed. 700 passed, 20 xfailed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants